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Abstract 

The Paraphrase Database (PPDB; Ganitke- 
vitch et at., 2013) is an extensive semantic re¬ 
source, consisting of a list of phrase pairs with 
(heuristic) confidence estimates. However, it 
is still unclear how it can best be used, due to 
the heuristic nature of the confidences and its 
necessarily incomplete coverage. We propose 
models to leverage the phrase pairs from the 
PPDB to build parametric paraphrase models 
that score paraphrase pairs more accurately 
than the PPDB’s internal scores while simul¬ 
taneously improving its coverage. They allow 
for learning phrase embeddings as well as im¬ 
proved word embeddings. Moreover, we in¬ 
troduce two new, manually annotated datasets 
to evaluate short-phrase paraphrasing mod¬ 
els. Using our paraphrase model trained using 
PPDB, we achieve state-of-the-art results on 
standard word and bigram similarity tasks and 
beat strong baselines on our new short phrase 
paraphrase tasksfjl ] 


1 Introduction 


Paraphrase detectiorE] is the task of analyz¬ 
ing two segments of text and determining if 
they have the same meaning despite differences 
in structure and wording. It is useful for 
a variety of NLP tasks like question answer¬ 
ing (Rinaldi et ah, 2003][Fader et ah, 20131 ), seman¬ 
tic parsing (Berant and Liang, 2014), textual entail- 


*We release our datasets, code, and trained models on the 
authors’ websites. 

2 This version differs from the previous one with the inclu¬ 
sion of Appendix A, which contains details about new higher 
dimensional embeddings we have released. These embeddings 
achieve human-level performance on SL999 and WS353. 

’See |Androutsopoulos and Malakasiotis (2010] > for a survey 
on approaches for detecting paraphrases. 


ment (Bosnia and Callison-Burch, 2007| , and ma¬ 
chine translation ( jMarton et al., 2009) . 

One component of many such systems is a para¬ 
phrase table containing pairs of text snippets, usu¬ 
ally automatically generated, that have the same 
meaning. The most recent work in this area is 
the Paraphrase Database (PPDB; Ganitkevitch et 
ah, 2013), a collection of confidence-rated para¬ 
phrases created using the pivoting technique of 
[Bannard and Callison-Burch (2005} ) over large par¬ 
allel corpora. The PPDB is a massive resource, con¬ 
taining 220 million paraphrase pairs. It captures 
many short paraphrases that would be difficult to ob¬ 
tain using any other resource. For example, the pair 
{we must do our utmost, we must make every effort } 
has little lexical overlap but is present in PPDB. The 
PPDB has recently been used for monolingual align¬ 
ment ( jYao et al., 2013| ), for predicting sentence sim¬ 
ilarity (Bjerva et ah, 2014 1 , and to improve the cov¬ 
erage of FrameNet (Rastogi and Van Durme, 20141. 


Though already effective for multiple NLP tasks, 
we note some drawbacks of PPDB. The first is 
lack of coverage: to use the PPDB to compare two 
phrases, both must be in the database. The second 
is that PPDB is a nonparametric paraphrase model; 
the number of parameters (phrase pairs) grows with 
the size of the dataset used to build it. In practice, 
it can become unwieldy to work with as the size of 
the database increases. A third concern is that the 
confidence estimates in PPDB are a heuristic com¬ 
bination of features, and their quality is unclear. 


We address these issues in this work by intro¬ 
ducing ways to use PPDB to construct paramet¬ 
ric paraphrase models. First we show that initial 
skip-gram word vectors ( jMikolov et al., 2013a[ ) can 
be fine-tuned for the paraphrase task by training 
on word pairs from PPDB. We call them PARA- 





















GRAM word vectors. We find additive composition 
of PARAGRAM vectors to be a simple but effective 
way to embed phrases for short-phrase paraphrase 
tasks. We find improved performance by training a 
recursive neural network (RNN; Socher et al., 2010) 
directly on phrase pairs from PPDB. 

We show that our resulting word and phrase rep¬ 
resentations are effective on a wide variety of tasks, 
including two new datasets that we introduce. The 
first, Annotated-PPDB, contains pairs from PPDB 
that were scored by human annotators. It can be used 
to evaluate paraphrase models for short phrases. We 
use it to show that the phrase embeddings produced 
by our methods are significantly more indicative of 
paraphrasability than the original heuristic scoring 
used by |Ganitkevitch et al. (2013) . Thus we use the 
power of PPDB to improve its contents. 

Our second dataset, ML-Paraphrase, is a re¬ 
annotation of the bigram similarity coipus from 
Mitchell and Lapata (20101. The task was origi¬ 


We release the new datasets, complete with anno¬ 
tation instructions and raw annotations, as well as 
our code and the trained models]]] 

2 Related Work 

There is a vast literature on representing words as 
vectors. The intuition of most methods to cre¬ 
ate these vectors (or embeddings) is that similar 
words have similar contexts dFirth, 1957| ). Ear¬ 
lier models made use of latent semantic analysis 
(LSA) ( jDeerwester et al., 1990) . Recently, more so¬ 
phisticated neural models, work originating with 


nally developed to measure semantic similarity of 
bigrams, but some annotations are not congruent 
with the functional similarity central to paraphrase 
relationships. Our re-annotation can be used to 
assess paraphrasing capability of bigram composi¬ 
tional models. 

In summary, we make the following contributions: 

Provide new PARAGRAM word vectors, learned 
using PPDB, that achieve state-of-the-art per¬ 
formance on the SimLex-999 lexical similarity 
task ( [Hill et al., 2014b| ) and lead to improved perfor¬ 
mance in sentiment analysis. 

Provide ways to use PPDB to embed phrases. We 

compare additive and RNN composition of PARA¬ 
GRAM vectors. Both can improve PPDB by re¬ 
ranking the paraphrases in PPDB to improve corre¬ 
lations with human judgments. They can be used as 
concise parameterizations of PPDB, thereby vastly 
increasing its coverage. We also perform a qualita¬ 
tive analysis of the differences between additive and 
RNN composition. 

Introduce two new datasets. The first contains 
PPDB phrase pairs and evaluates how well models 
can measure the quality of short paraphrases. The 
second is a new annotation of the bigram similar¬ 


(Bengio et al., 2003), have been gaining popular¬ 
ity dMikolov et al., 2013a[ Pennington et al,, 2014[ ). 
These embeddings are now being used in new ways 
as they are being tailored to specific downstream 
tasks dBansal et al., 2014] ). 

Phrase representations can be created from 
word vectors using compositional models. Sim¬ 
ple but effective compositional models were stud¬ 
ied by Mitchell and Lapata ( 20081: 12010 ) and 
Blacoe and Lapata (2012). They compared a va¬ 


riety of binary operations on word vectors and 
found that simple point-wise multiplication of 
explicit vector representations performed very 
well. Other works like Zanzotto et al. (20l0| and 


Baroni and Zamparelli (2010) also explored compo 


sition using models based on operations of vectors 
and matrices. 

More recent work has shown that the 
extremely efficient neural embeddings of 
[Mikolov et al. (2013a| also do well on compo¬ 
sitional tasks simply by adding the word vectors 
dMikolov et al., 2013b] ). jHashimoto et al. (2014j ) 
introduced an alternative word embedding and 
compositional model based on predicate-argument 
structures that does well on two simple com¬ 
position tasks, including the one introduced by 


Mitchell and Lapata (2010). 


An alternative approach to composition, used by 
Soch er e t al. (20111), is to train a recursive neural 


ity task in Mitchell and Lapata (2010) that makes it 
suitable for evaluating bigram paraphrases. 


network (RNN) whose structure is defined by a bi¬ 
narized parse tree. In particular, they trained their 
RNN as an unsupervised autoencoder. The RNN 
captures the latent structure of composition. Recent 
work has shown that this model struggles in tasks in- 


4 available on the authors’ websites 










































volving compositionality(Blacoe and Lapata, 2012 
IHashimoto et al., 2014| )o However, we found suc¬ 
cess using RNNs in a supervised setting, similar 
to |Socher et al. (2014] ), who used RNNs to learn 
representations for image descriptions. The objec¬ 
tive function we used in this work was motivated 
by their multimodal objective function for learning 
joint image-sentence representations. 

Lastly, the PPDB has been used along with other 
resources to learn word embeddings for several 
tasks, including semantic similarity, language mod¬ 
eling, predicting human judgments, and classifi¬ 
cation ( |Yu and Dredze, 2014j Faruqui et al., 20151. 
Concurrently with our work, it has also been used 
to construct paraphrase models for short phrases 
( jYu and Dredze, 2015[ ). 


3 New Paraphrase Datasets 

We created two novel datasets: (1) Annotated- 
PPDB, a subset of phrase pairs from PPDB which 
are annotated according to how strongly they rep¬ 
resent a paraphrase relationship, and (2) ML- 
Paraphrase, a re-annotation of the bigram similarity 


dataset from Mitchell and Lapata (2010), again an¬ 
notated for strength of paraphrase relationship. 


3.1 Annotated-PPDB 

Our motivation for creating Annotated-PPDB was 
to establish a way to evaluate compositional para¬ 
phrase models on short phrases. Most ex¬ 
isting paraphrase tasks focus on words, like 
SimLex-999 ( jHill et ah, 2014b| ), or entire sentences, 
such as the Microsoft Research Paraphrase Cor¬ 
pus ( [Dolan et al., 2004] Quirk et al., 2004| ). To our 
knowledge, there are no datasets that focus on the 
paraphrasability of short phrases. Thus, we cre¬ 
ated Annotated-PPDB so that researchers can focus 
on local compositional phenomena and measure the 
performance of models directly—avoiding the need 
to do so indirectly in a sentence-level task. Models 
that have strong performance on Annotated-PPDB 
can be used to provide more accurate confidence 
scores for the paraphrases in the PPDB as well as re¬ 
duce the need for large paraphrase tables altogether. 


5 We also replicated this approach and found training to be 
time-consuming even using low-dimensional word vectors. 


Annotated-PPDB was created in a multi-step pro¬ 
cess (outlined below) involving various automatic 
filtering steps followed by crowdsourced human an¬ 
notation. One of the aims for our dataset was to col¬ 
lect a variety of paraphrase types—we wanted to in¬ 
clude pairs that were non-trivial to recognize as well 
as those with a range of similarity and length. We fo¬ 
cused on phrase pairs with limited lexical overlap to 
avoid including those with only trivial differences. 

We started with candidate phrases extracted from 
the first 10M pairs in the XXL version of the PPDB 
and then executed the following steps@ 

Filter phrases for quality: Only those phrases 
whose tokens were in our vocabulary were retained^ 
Next, all duplicate paraphrase pairs were removed; 
in PPDB, these are distinct pairs that contain the 
same two phrases with the order swapped. 

Filter by lexical overlap: Next, we calculated the 
word overlap score in each phrase pair and then re¬ 
tained only those pairs that had a score of less than 
0.5. By word overlap score, we mean the fraction 
of tokens in the smaller of the phrases with Leven- 
shtein distance < 1 to a token in the larger of the 
phrases. This was done to exclude less interesting 
phrase pairs like (my dad had, my father had) or 
(ballistic missiles, of ballistic missiles) that only dif¬ 
fer in a synonym or the addition of a single word. 
Select range of paraphrasabilities: To balance our 
dataset with both clear paraphrases and erroneous 
pairs in PPDB, we sampled 5,000 examples from ten 
chunks of the first 10M initial phrase pairs where a 
chunk is defined as 1M phrase pairs. 

Select range of phrase lengths: We then selected 
1,500 phrases from each 5000-example sample that 
encompassed a wide range of phrase lengths. To do 
this, we first binned the phrase pairs by their effec¬ 
tive size. Let n\ be the number of tokens of length 
greater than one character in the first phrase and ri 2 
the same for the second phrase. Then the effective 
size is defined as max(ni,n 2 ). The bins contained 
pairs of effective size of 3, 4, and 5 or more, and 500 

’’Note that the confidence scores for phrase pairs in PPDB 
are based on a weighted combination of features with weights 
determined heuristically. The confidence scores were used to 
place the phrase pairs into their respective sets (S, M, L, XL, 
XXL, etc.), where each larger set subsumes all smaller ones. 

7 Throughout, our vocabulary is defined as the most common 
100K word types in English Wikipedia, following tokenization 
and lowercasing (see 






























pairs were selected from each bin. This gave us a 
total of 15,000 phrase pairs. 

Prune to 3,000: 3,000 phrase pairs were then se¬ 
lected randomly from the 15,000 remaining pairs to 
form an initial dataset, Annotated-PPDB-3K. The 
phrases were selected so that every phrase in the 
dataset was unique. 

Annotate with Mechanical Turk: The dataset was 
then rated on a scale from 1-5 using Amazon Me¬ 
chanical Turk, where a score of 5 denoted phrases 
that are equivalent in a large number of contexts, 3 
meant that the phrases had some overlap in mean¬ 
ing, and 1 indicated that the phrases were dissimilar 
or contradictory in some way (e.g., can not adopt 
and is able to accept). 

We only permitted workers whose location was in 
the United States and who had done at least 1,000 
HITS with a 99% acceptance rate. Each example 
was labeled by 5 annotators and their scores were 
averaged to produce the final rating. Table Q] shows 
some statistics of the data. Overall, the annotated 
data had a mean deviation (MD0 of 0.80. Table Q] 
shows that overall, workers found the phrases to be 
of high quality, as more than two-thirds of the pairs 
had an average score of at least 3. Also from the Ta¬ 
ble, we can see that workers had stronger agreement 
on very low and very high quality pairs and were 
less certain in the middle of the range. 

Prune to 1,260: To create our final dataset, 
Annotated-PPDB, we selected 1,260 phrase pairs 
from the 3,000 annotations. We did this by first bin¬ 
ning the phrases into 3 categories: those with scores 
in the interval [1,2.5), those with scores in the in¬ 
terval [2.5, 3.5], and those with scores in the interval 
(3.5, 5]. We took the 420 phrase pairs with the low¬ 
est MD in each bin, as these have the most agree¬ 
ment about their label, to form Annotated-PPDB. 

These 1,260 examples were then randomly split 
into a development set of 260 examples and a test set 
of 1,000 examples. The development set had an MD 
of 0.61 and the test set had an MD of 0.60, indicating 
the final dataset had pairs of higher agreement than 
the initial 3,000. 


Score Range 

MD 

% of Data 

[1,2) 

0.66 

8.1 

[2,3) 

1.05 

20.0 

[3,4) 

0.93 

34.9 

[4,5] 

0.59 

36.9 


Table 1: An analysis of Annotated-PPDB-3K extracted from 
PPDB. The statistics shown are for the splits of the data accord¬ 
ing to the average score by workers. MD denotes mean devia¬ 
tion and % of Data refers to the percentage of our dataset that 
fell into each range. 


3.2 ML-Paraphrase 


Our second 
ML-Paraphrase, 
similarity task 


newly-annotated dataset, 
is based on the bigram 
originally introduced by 
Mitchell and Lapata (20101; we refer to the 
original annotations as the ML dataset. 

The ML dataset consists of human similarity rat¬ 
ings for three types of bigrams: adjective-noun (JN), 
noun-noun (NN), and verb-noun (VN). Through 
manual inspection, we found that the annotations 
were not consistent with the notion of similarity 
central to paraphrase tasks. Lor instance, television 
set and television programme were the highest rated 
phrases in the NN section (based on average anno¬ 
tator score). Similarly, one of the highest ranked JN 
pairs was older man and elderly woman. This indi¬ 
cates that the annotations reflect topical similarity in 
addition to capturing functional or definitional simi¬ 
larity. 

Therefore, we had the data re-annotated by two 
authors of this paper who are native English speak¬ 
ers]^ The bigrams were labeled on a scale from 1- 
5 where 5 denotes phrases that are equivalent in a 
large number of contexts, 3 indicates the phrases are 
roughly equivalent in a narrow set of contexts, and 
1 means the phrases are not at all equivalent in any 
context. Lollowing annotation, we collapsed the rat¬ 
ing scale by merging 4s and 5s together and Is and 
2s together. 

Statistics for the data are shown in Table [2j We 
show inter-annotator Spearman p and Cohen’s k in 
columns 2 and 3, indicating substantial agreement 
on the JN and VN portions but only moderate agree¬ 
ment on NN. In fact, when evaluating our NN anno- 


S MD is similar to standard deviation, but uses absolute value 
instead of squared value and thus is both more intuitive and less 
sensitive to outliers. 


9 We tried using Mechanical Turk here, but due to such short 
phrases, with few having the paraphrase relationship, workers 
did not perform well on the task. 










Data 

IA p 

IAk 

ML comp, p 

ML Human p 

JN 

0.87 

0.79 

0.56 

0.52 

NN 

0.64 

0.58 

0.38 

0.49 

VN 

0.73 

0.73 

0.55 

0.55 


Table 2: Inter-annotator agreement of ML-Paraphrase and com¬ 
parison with ML dataset. Columns 2 and 3 show the inter- 
annotator agreement between the two annotators measured with 
Spearman p and Cohen’s k. Column 4 shows the p between 
ML-Paraphrase and all of the ML dataset. The last column is 
the average human p on the ML dataset. 


tations against those from the original ML data (col¬ 
umn 4), we find p to be 0.38, well below the average 
human correlation of 0.49 (final column) reported by 
Mitchell and Lapata and also surpassed by pointwise 
multiplication (Mitchell and Lapata, 2010). This 
suggests that the original NN portion, more so than 
the others, favored a notion of similarity more re¬ 
lated to association than paraphrase. 


4 Paraphrase Models 

We now present parametric paraphrase models and 
discuss training. Our goal is to embed phrases into 
a low-dimensional space such that cosine similarity 
in the space corresponds to the strength of the para¬ 
phrase relationship between phrases. 

We use a recursive neural network (RNN) similar 
to that used by |Socher et al. (2014] ). We first use a 
constituent parser to obtain a binarized parse of a 
phrase. For phrase p, we compute its vector g(p ) 
through recursive computation on the parse. That is, 
if phrase p is the yield of a parent node in a parse 
tree, and phrases c\ and C 2 are the yields of its two 
child nodes, we define g(p) recursively as follows: 


aip) = f{W[g{ci)-,g{c 2 )} +b) 

where / is an element-wise activation function 
(tanh), \g{c\)\g{c 2 )\ £ M 2n is the concatenation 
of the child vectors, W £ M . nx2n is the composi¬ 
tion matrix, b £ M n is the offset, and n is the di¬ 
mensionality of the word embeddings. If node p 
has no children (i.e., it is a single token), we define 
g(p) = WiJ'p where W w is the word embedding 
matrix in which particular word vectors are indexed 
using superscripts. The trainable parameters of the 
model are W, b, and W w . 


4.1 Objective Functions 

We now present objective functions for training on 
pairs extracted from PPDB. The training data con¬ 
sists of (possibly noisy) pairs taken directly from the 
original PPDB. In subsequent sections, we discuss 
how we extract training pairs for particular tasks. 

We assume our training data consists of a set X of 
phrase pairs {x\,x 2 ), where x\ and x 2 are assumed 
to be paraphrases. To learn the model parame¬ 
ters (W. b , W w ), we minimize our objective function 
over the data using AdaGrad dDuchi et ah, 20 llj ) 
with mini-batches. The objective function follows: 


mm -—- 

w,b,w w \X\ 


£ 

(xi,x2)ex 


max(0, 5 - g(x i) ■ g{x 2 ) + g(x i) • g(t{)) 


+ max(0 ,5 - g(x i) • g(x 2 ) + g(x 2 ) ■ g(t 2 ))J 

+ Ary(||VF|| 2 + ||6|| 2 ) + A Wu , || W Winttial - W w \\ 2 

( 1 ) 


where A w and X\y w are regularization parameters, 
W u , uaUa i is the initial word embedding matrix, 5 is 
the margin (set to 1 in all of our experiments), and 
t\ and t 2 are carefully-selected negative examples 
taken from a mini-batch during optimization. 

The intuition for this objective is that we want 
the two phrases to be more similar to each other 
(g(x i) • g{x 2 )) than either is to their respective neg¬ 
ative examples t\ and t 2 , by a margin of at least 5. 


Selecting Negative Examples To select t\ and t 2 
in Eq.[Q we simply chose the most similar phrase in 
the mini-batch (other than those in the given phrase 
pair). E.g., for choosing t\ for a given {x\,x 2 )\ 

ti= argmax g(x 1 ) ■ g(t) 
t:(t,-)eX b \{(x i,Z 2 >} 

where Xb C X is the current mini-batch. That is, 
we want to choose a negative example ti that is sim¬ 
ilar to Xi according to the current model parameters. 
The downside of this approach is that we may oc¬ 
casionally choose a phrase ti that is actually a true 
paraphrase of a;,;. We also tried a strategy in which 
we selected the least similar phrase that would trig¬ 
ger an update g(ti) ■ g{xi) > g(xi) ■ g(x 2 ) - 6), 
but we found the simpler strategy above to work bet¬ 
ter and used it for all experiments reported below. 















Discussion The objective in Eq.Q]is similar to one 
used by |Socher et al. (20 14[ >, but with several differ¬ 
ences. Their objective compared text and projected 
images. They also did not update the underlying 
word embeddings; we do so here, and in a way such 
that they arc penalized from deviating from their ini¬ 
tialization. Also for a given (x\,X 2 ), they do not 
select a single t\ and £2 as we do, but use the en¬ 
tire training set, which can be very expensive with a 
large training dataset. 

We also experimented with a simpler objective 
that sought to directly minimize the squared L2- 
nornr between g(x 1 ) and g(x 2 ) in each pair, along 
with the same regularization terms as in Eq. |T] 
One problem with this objective function is that the 
global minimum is 0 and is achieved simply by driv¬ 
ing the parameters to 0. We obtained much better 
results using the objective in Eq. [I] 


Training Word Paraphrase Models To train just 
word vectors on word paraphrase pairs (again from 
PPDB), we used the same objective function as 
above, but simply dropped the composition terms. 
This gave us an objective that bears some similarity 
to the skip-gram objective with negative sampling 
in word2vec dMikolov et al., 2013a| ). Both seek 
to maximize the dot products of certain word pairs 
while minimizing the dot products of others. This 
objective function is: 


f n TTi ( E max (°’ 5 - w » l] • W ™ 2) 

“ 1 1 V (xux 2 )ex 

+ + nrax(0, 5 - ■ W^ + 


5 Experiments - Word Paraphrasing 

We first present experiments on learning lexi¬ 
cal paraphrasability. We train on word pairs 
from PPDB and evaluate on the SimLex-999 
dataset ( jHill et al., 2014bj ), achieving the best results 
reported to date. 

5.1 Training Procedure 

To learn word vectors that reflect paraphrasability, 
we optimized Eq. [2] There are many tunable hyper¬ 
parameters with this objective, so to make training 
tractable we fixed the initial learning rates for the 
word embeddings to 0.5 and the margin <5 to 1. Then 
we did a coarse grid search over a parameter space 
for \w w and the mini-batch size. We considered 
X\Y w values in {10 —2 ,10~ 3 ,..., 10 —7 , 0} and mini¬ 
batch sizes in {100, 250, 500, 1000}. We trained 
for 20 epochs for each set of hyperparameters using 
AdaGrad dDuchi et al., 2011| ). 

For all experiments, we initialized our word 
vectors with skip-gram vectors trained using 
word2vec dMikolov et ah, 2013a| ). The vectors 
were trained on English Wikipedia (tokenized and 
lowercased, yielding 1.8B tokens')] 10 ! We used a 
window size of 5 and a minimum count cut-off of 
60, producing vectors for approximately 270K word 
types. We retained vectors for only the 100K most 
frequent words, averaging the rest to obtain a single 
vector for unknown words. We will refer to this set 
of the 100K most frequent words as our vocabulary. 

5.2 Extracting Training Data 


■ WM)) + Aw. || W Winitial - Wj 2 (2) 

It is like Eq. Q] except with word vectors replacing 
the RNN composition function and with the regular¬ 
ization terms on the W and b removed. 

We further found we could improve this model by 
incorporating constraints. From our training pairs, 
for a given word w, we assembled all other words 
that were paired with it in PPDB and all of their lem¬ 
mas. These were then used as constraints during the 
pairing process: a word t could only be paired with 
w if it was not in its list of assembled words. 


For training, we extracted word pairs from the lexi¬ 
cal XL section of PPDB. We used the XL data for 
all experiments, including those for phrases. We 
used XL instead of XXL because XL has better qual¬ 
ity overall while still being large enough so that we 
could be selective in choosing training pairs. There 
are a total of 548,085 pairs. We removed 174,766 
that either contained numerical digits or words not 
in our vocabulary. We then removed 260,425 re¬ 
dundant pairs, leaving us with a final training set of 
112,894 word pairs. 


10 We used the December 2, 2013 snapshot. 














Model 

n 

SL999 p 

skip-gram 

25 

0.21 

skip-gram 

1000 

0.38 

PARAGRAM w s 

25 

0.56* 

+ constraints 

25 

0.58* 


Hill et al. (2014b) 

200 

0.446 


Hill et al. (2014a) 

- 

0.52 

inter-annotator agreement 

N/A 

0.67 


Table 3: Results on the SimLex-999 (SL999) word similarity 
task obtained by performing hyperparameter tuning based on 
2xWS-S —WS-R and treating SL999 as a held-out test set. n 
is word vector dimensionality. A * indicates statistical signifi¬ 
cance (p < 0.05) over the 1000-dimensional skip-gram vectors. 


5.3 Tuning and Evaluation 


annotator agreement from [Hill et al. (2014b[ ) P^1 

The table illustrates that, by training on PPDB, 
we can surpass the previous best correlations on 
SL999 by 4-6% absolute, achieving the best results 
reported to date. We also find that we can train 
low-dimensional word vectors that exceed the per¬ 
formance of much larger vectors. This is very use¬ 
ful as using large vectors can increase both time and 
memory consumption in NLP applications. 

To generate word vectors to use for downstream 
applications, we chose hyperparameters so as to 
maximize performance on SL999f^l These word 
vectors, which we refer to as PARAGRAM vectors, 
had a p of 0.57 on SL999. We use them as initial 
word vectors for the remainder of the paper. 


Hyperparameters were tuned using the wordsim-353 
(WS353) dataset ( [Finkelstein et al., 2001] ), specifi¬ 
cally its similarity (WS-S) and relatedness (WS- 


R) partitions (Agirre et al., 2009). In particular, we 
tuned to maximize 2xWS-S correlation minus the 
WS-R correlation. The idea was to reward vectors 
with high similarity and relatively low relatedness, 
in order to target the paraphrase relationship. 

After tuning, we evaluated the best hy¬ 
perparameters on the SimLex-999 (SL999) 
dataset (Hill et al., 2014b). We chose SL999 as 


our primary test set as it most closely evaluates 
the paraphrase relationship. Even though WS-S 
is a close approximation to this relationship, it 
does not include pairs that are merely associated 
and assigned low scores, which SL999 does (see 
discussion in Hill et al., 2014b). 

Note that for all experiments we used cosine sim¬ 
ilarity as our similarity metric and evaluated the sta¬ 
tistical significance of dependent correlations using 


the one-tailed method of (Steiger, 1980 1 . 


5.4 Results 

Table [3] shows results on SL999 when improving 
the initial word vectors by training on word pairs 
from PPDB, both with and without constraints. The 
“PARAGRAM ws” rows show results when tuning to 
maximize 2 xWS-S — WS-R. We also show results 
for strong skip-gram baselines and the best results 
from the literature, including the state-of-the-art re¬ 
sults from [Hill et al. (2014a|) as well as the inter- 


5.5 Sentiment Analysis 


As an extrinsic evaluation of our PARAGRAM word 
vectors, we used them in a convolutional neu¬ 
ral network (CNN) for sentiment analysis. We 
used the simple CNN from |Kim (201~4 i and the 
binary sentence-level sentiment analysis task from 
|Socher et al. (2013] ). We used the standard data 
splits, removing examples with a neutral rating. 
We trained on all constituents in the training set 
while only using full sentences from development 
and test, giving us train/development/test sizes of 
67,349/872/1,821. 


The CNN uses m -gram filters, each of which is an 
m x n vector. The CNN computes the inner product 
between an m -gram filter and each m -gram in an 
example, retaining the maximum match (so-called 
“max-pooling”). The score of the match is a single 
dimension in a feature vector for the example, which 
is then associated with a weight in a linear classifier 
used to predict positive or negative sentiment. 

While |Kim (2014] ) used m -gram filters of sev¬ 
eral lengths, we only used unigram filters. We 
also fixed the word vectors during learning (called 
“static” by Kim). After learning, the unigram fil¬ 
ters correspond to locations in the fixed word vec¬ 
tor space. The learned classifier weights represent 
how strongly each location corresponds to positive 
or negative sentiment. We expect this static CNN to 


1 * jHill et al. (2014a) did not report the dimensionality of the 
vectors that led to their state-of-the-art results. 

12 We did not use constraints during training. 



































word vectors 

n 

accuracy (%) 

skip-gram 

25 

77.0 

skip-gram 

50 

79.6 

PARAGRAM 

25 

80.9 


Table 4: Test set accuracies when comparing embeddings 
in a static CNN on the binary sentiment analysis task from 
|Socher et al. (2013j >. 

be more effective if the word vector space separates 
positive and negative sentiment. 

In our experiments, we compared baseline skip- 
gram embeddings to our PARAGRAM vectors. We 
used AdaGrad learning rate of 0.1, mini-batches of 
size 10, and a dropout rate of 0.5. We used 200 un¬ 
igram filters and rectified linear units as the activa¬ 
tion (applied to the filter output + filter bias). We 
trained for 30 epochs, predicting labels on the de¬ 
velopment set after each set of 3,000 examples. We 
recorded the highest development accuracy and used 
those parameters to predict labels on the test set. 

Results are shown in Table [4] We see improve¬ 
ments over the baselines when using PARAGRAM 
vectors, even exceeding the performance of higher¬ 
dimensional skip-gram vectors. 

6 Experiments - Compositional 
Paraphrasing 

In this section, we describe experiments on a variety 
of compositional phrase-based paraphrasing tasks. 
We start with the simplest case of bigrams, and then 
proceed to short phrases. For all tasks, we again 
train on appropriate data from PPDB and test on 
various evaluation datasets, including our two novel 
datasets (Annotated-PPDB and ML-Paraphrase). 

6.1 Training Procedure 

We trained our models by optimizing Eq. D] using 
AdaGrad dDuchi et ah, 2011} . We fixed the initial 
learning rates to 0.5 for the word embeddings and 
0.05 for the composition parameters, and the mar¬ 
gin to 1. Then we did a coarse grid search over a 
parameter space for A w, and mini-batch size. 

For A w w , our search space again consisted 
of {10~ 2 ,10~ 3 ,..., 10 -7 ,0}, for A w it was 
{10 —1 ,10 2 ,10 —3 ,0}, and we explored batch 
sizes of {100, 250, 500, 1000, 2000}. When ini¬ 
tializing with PARAGRAM vectors, the search 
space for \w w was shifted upwards to be 


{10,1,10" 1 ,10” 3 ,..., 10~ 6 } to reflect our in¬ 
creased confidence in the initial vectors. We trained 
only for 5 epochs for each set of parameters. For 
baselines, we used the same initial skip-gram 
vectors as in Section [5] 


6.2 Evaluation and Baselines 


For all experiments, we again used cosine similarity 
as our similarity metric and evaluated the statistical 


significance using the method of (Steiger, 19801. 

A baseline used in all compositional experi¬ 
ments is vector addition of skip-gram (or PARA¬ 
GRAM) word vectors. Unlike explicit word vec¬ 
tors, where point-wise multiplication acts as a con¬ 
junction of features and performs well on composi¬ 


tion tasks (Mitchell and Fapata, 2008), using addi¬ 
tion with skip-gram vectors dMikolov et al., 2013bl ) 
gives better performance than multiplication. 

6.3 Bigram Paraphrasability 

To evaluate our ability to paraphrase bigrams, we 
consider the original bigram similarity task from 


Mitchell and Fapata (2010) as well as our newly 


annotated version of it: MF-Paraphrase. 

Extracting Training Data Training data for 
these tasks was extracted from the XL por¬ 
tion of PPDB. The bigram similarity task from 


Mitchell and Lapata (2010) contains three types of 


bigrams: adjective-noun (JN), noun-noun (NN), and 
verb-noun (VN). We aimed to collect pairs from 
PPDB that mirrored these three types of bigrams. 

We found parsing to be unreliable on such 
short segments of text, so we used a POS tag¬ 


ger (Manning et al., 2014) to tag the tokens in each 
phrase. We then used the word alignments in PPDB 
to extract bigrams for training. For JN and NN, 
we extracted pairs containing aligned, adjacent to¬ 
kens in the two phrases with the appropriate part- 
of-speech tag. Thus we extracted pairs like (easy 
job, simple task) for the JN section and (town meet¬ 
ing, town council) for the NN section. We used a 
different strategy for extracting training data for the 
VN subset: we took aligned VN tokens and took the 
closest noun after the verb. This was done to approx¬ 
imate the direct object that would have been ide¬ 
ally extracted with a dependency parse. An example 
from this section is (achieve goal, achieve aim). 



















Model 

Mitchel 

and Lapata (2010) Bigrams 

ML-Paraphrase 


word vectors n comp. 

JN 

NN 

VN 

Avg 

JN 

NN 

VN 

Avg 

skip-gram 25 + 

0.36 

0.44 

0.36 

0.39 

0.32 

0.35 

0.42 

0.36 

PARAGRAM 25 + 

0.44* 

0.34 

0.48* 

0.42 

0.50* 

0.29 

0.58*1 

0.46 

PARAGRAM 25 RNN 

0.51*1 

0.401 

0.50*1 

0.47 

0.57*1 

0.441 

0.55* 

0.52 

Hashimoto et al. (2014) 

0.49 

0.45 

0.46 

0.47 

0.38 

0.39 

0.45 

0.41 

Mitchell and Lapata (2010 


0.46 

0.49 

0.38 

0.44 

- 

- 

- 

- 

Human 

- 

- 

- 

- 

0.87 

0.64 

0.73 

0.75 


Table 5: Results on the test section of the bigram similarity task of iMitchell and Lapata (2010| > and our newly annotated version 
(ML-Paraphrase). (n) shows the word vector dimensionality and (“comp.”) shows the composition function used: “+” is vector 
addition and “RNN” is the recursive neural network. The * indicates statistically significant (p < 0.05) over the skip-gram model, 
f statistically significant over the {PARAGRAM, +} model, and \ statistically significant over|Hashimoto et al. (2014}. 


We removed phrase pairs that (1) contained words 
not in our vocabulary, (2) were redundant with oth¬ 
ers, (3) contained brackets, or (4) had Levenshtein 
distance < 1. The final criterion helps to ensure that 
we train on phrase pairs with non-trivial differences. 
The final training data consisted of 133,997 JN pairs, 
62,640 VN pairs and 35,601 NN pairs. 


Baselines In addition to RNN models, we report 
baselines that use vector addition as the composition 
function, both with our skip-gram embeddings and 
PARAGRAM embeddings from Section [5] 

We also compare to several results from prior 
work. When doing so, we took their best correla¬ 
tions for each data subset. That is, the JN and NN re¬ 
sults from Mitchell and Lapata (2010 ) use their mul¬ 
tiplicative model and the VN results use their dila¬ 
tion model. From Hashimot o et al. (2014) we used 
their PAS-CLBLM Addi and PAS-CLBLM Add„i 


models. We note that their vector dimensionalities 
are larger than ours, using n = 2000 and 50 respec¬ 
tively. 


Results Results are shown in Table [5] We re¬ 
portion of the original 
) dataset (ML) as well as 
the entirety of our newly-annotated dataset (ML- 
Paraphrase). RNN results on ML were tuned on the 
respective development sections and RNN results on 
ML-Paraphrase were tuned on the entire ML dataset. 

Our RNN model outperforms results from the lit¬ 
erature on most sections in both datasets and its av¬ 
erage correlations are among the highest^ The one 

13 The results obtained here differ from those reported in 
|Hashimoto et al. (2014) as we scored their vectors with a 
newer Python implementation of Spearman p that handles ties 
(Hashimoto, P.C.). 


port results on the test 


Mitchell and Lapata (2010 


subset of the data that posed difficulty was the NN 
section of the ML dataset. We suspect this is due 
to the reasons discussed in Section [372} for our ML- 
Paraphrase dataset, by contrast, we do see gains on 
the NN section. 

We also outperform the strong baseline of adding 
1000-dimensional skip-gram embeddings, a model 
with 40 times the number of parameters, on our ML- 
Paraphrase dataset. This baseline had correlations of 
0.45, 0.43, and 0.47 on the JN, NN, and VN parti¬ 
tions, with an average of 0.45—below the average 
p of the RNN (0.52) and even the {PARAGRAM, +} 
model (0.46). 

Interestingly, the type of vectors used to initial¬ 
ize the RNN has a significant effect on performance. 
If we initialize using the 25-dimensional skip-gram 
vectors, the average p on ML-Paraphrase drops to 
0.43, below even the {PARAGRAM, +} model. 

6.4 Phrase Paraphrasability 

In this section we show that by training a 
model based on filtered phrase pairs in PPDB, 
we can actually distinguish between quality para¬ 
phrases and poor paraphrases in PPDB better 
than the original heuristic scoring scheme from 
Ganitkevitch et al. (2013| ). 

Extracting Training Data As before, training 
data was extracted from the XL section of PPDB. 
Similar to the procedure to create our Annotated- 
PPDB dataset, phrases were filtered such that only 
those with a word overlap score of less than 0.5 
were kept. We also removed redundant phrases and 
phrases that contained tokens not in our vocabulary. 
The phrases were then binned according to their ef¬ 
fective size and 20,000 examples were selected from 







































bins of effective sizes of 3, 4, and more than 5, cre¬ 
ating a training set of 60,000 examples. Care was 
taken to ensure that none of our training pairs was 
also present in our development and test sets. 


Baselines We compare our models with strong 
lexical baselines. The first, strict word overlap, is 
the percentage of words in the smaller phrase that 
are also in the larger phrase. We also include a ver¬ 
sion where the words are lemmatized prior to the 
calculation. 

We also train a support vector regression model 
(epsilon-SVR) ( Chang and Lin, 2011[ ) on the 33 fea¬ 
tures that are included for each phrase pair in PPDB. 
We scaled the features such that each lies in the in¬ 
terval [—1,1] and tuned the parameters using 5-fold 
cross validation on our dev set0 We then trained on 
the entire dev set after finding the best performing 
C and e combination and evaluated on the test set of 
Annotated-PPDB. 


Model 

word vectors n comp. 

Annotated-PPDB 

skip-gram 25 + 

0.20 

PARAGRAM 25 + 

0.32* 

PARAGRAM 25 RNN 

0.40*H 

Ganitkevitch et al. (2013) 

0.25 

word overlap (strict) 

0.26 

word overlap (lemmatized) 

0.20 

PPDB+SVR 

0.33 


Table 6: Spearman correlation on Annotated-PPDB. The * 
indicates statistically significant (p < 0.05) over the skip- 
gram model, the f indicates statistically significant over the 
(PARAGRAM, +} model, and the f indicates statistically sig¬ 
nificant over PPDB+SVR. 


Results We evaluated on our Annotated-PPDB 
dataset described in il3.ll Table [6] shows the Spear¬ 
man correlations on the 1000-example test set. RNN 
models were tuned on the development set of 260 
examples. All other methods had no hyperparame¬ 
ters and therefore required no tuning. 

We note that the confidence estimates from 


Ganitkevitch et al. (20131 reach a p of 0.25 on the 


test set, similar to the results of strict overlap. While 
25-dimensional skip-gram embeddings only reach 
0.20, we can improve this to 0.32 by fine-tuning 
them using PPDB (thereby obtaining our PARA- 


14 We tuned both parameters over (2 10 ,2 9 ,..., 2 10 }. 


GRAM vectors). By using the PARAGRAM vectors 
to initialize the RNN, we reach a correlation of 0.40, 
which is better than the PPDB confidence estimates 
by 15% absolute. 

We again consider addition of 1000-dimensional 
skip-gram embeddings as a baseline, and they con¬ 
tinue to perform strongly (p = 0.37). The RNN ini¬ 
tialized with PARAGRAM vectors does reach a higher 
p (0.40), but the difference is not statistically signif¬ 
icant (p = 0.16). Thus we can achieve similarly- 
strong results with far fewer parameters. 

This task also illustrates the importance of initial¬ 
izing our RNN model with appropriate word embed¬ 
dings. An RNN initialized with skip-gram vectors 
has a modest p of 0.22, well below the p of the RNN 
initialized with PARAGRAM vectors. Clearly, ini¬ 
tialization is important when optimizing non-convex 
objectives like ours, but it is noteworthy that our best 
results came from first improving the word vectors 
and then learning the composition model, rather than 
jointly learning both from scratch. 

7 Qualitative Analysis 


Score Range 

+ 

RNN 

[1,2) 

2.35 

2.08 

[2,3) 

1.56 

1.38 

[3,4) 

0.87 

0.85 

[4,5] 

0.43 

0.47 


Table 7: Average absolute error of addition and RNN models 
on different ranges of gold scores. 


We performed a qualitative analysis to uncover 
sources of error and determine differences between 
adding PARAGRAM vectors and using an RNN ini¬ 
tialized with them. To do so, we took the output 
of both systems on Annotated-PPDB and mapped 
their cosine similarities to the interval [1,5]. We 
then computed their absolute error as compared to 
the gold ratings. 

Table [7] shows how the average of these absolute 
errors changes with the magnitude of the gold rat¬ 
ings. The RNN performs better (has lower average 
absolute error) for less similar pairs. Vector addi¬ 
tion only does better on the most similar pairs. This 
is presumably because the most positive pairs have 
high word overlap and so can be represented effec¬ 
tively with a simpler model. 




















Index 

Phrase 1 

Phrase 2 

Length Ratio 

Overlap Ratio 

Gold 

RNN 

+ 

1 

scheduled to be held in 

that will take place in 

1.0 

0.4 

4.6 

2.9 

4.4 

2 

according to the paper, 

the newspaper reported that 

0.8 

0.5 

4.6 

2.8 

4.1 

3 

at no cost to 

without charge to 

0.75 

1.0 

4.8 

3.1 

4.6 

4 

’s surname 

family name of 

0.67 

1.0 

4.4 

2.8 

4.1 

5 

could have an impact on 

may influence 

0.4 

0.5 

4.6 

4.2 

3.2 

6 

to participate actively 

to play an active role 

0.6 

0.67 

5.0 

4.8 

4.0 

7 

earliest opportunity 

early as possible 

0.67 

0.0 

4.4 

4.3 

2.9 

8 

does not exceed 

is no more than 

0.75 

0.0 

5.0 

4.8 

3.5 


Table 8: Illustrative phrase pairs from Annotated-PPDB with gold similarity > 4. The last three columns show the gold similarity 
score, the similarity score of the RNN model, and the similarity score of vector addition. We note that addition performs better 
when the pairs have high length ratio (rows 1-2) or overlap ratio (rows 3—4) while the RNN does better when those values are low 
(rows 5-6 and 7-8 respectively). Boldface indicates smaller error compared to gold scores. 


To further investigate the differences between 
these models, we removed those pairs with gold 
scores in [2,4], in order to focus on pairs with ex¬ 
treme scores. We identified two factors that dis¬ 
tinguished the performance between the two mod¬ 
els: length ratio and the amount of lexical overlap. 
We did not find evidence that non-compositional 
phrases, such as idioms, were a source of error as 
these were not found in ML-Paraphrase and only ap¬ 
pear rarely in Annotated-PPDB. 

We define length ratio as simply the number of 
tokens in the smaller phrase divided by the number 
of tokens in the larger phrase. Overlap ratio is the 
number of equivalent tokens in the phrase pair di¬ 
vided by the number of tokens in the smaller of the 
two phrases. Equivalent tokens are defined as to¬ 
kens that are either exact matches or are paired up in 
the lexical portion of PPDB used to train the PARA- 
GRAM vectors. 

Table [9] shows how the performance of the mod¬ 
els changes under different values of length ratio and 
overlap ratio0 The values in this table are the per¬ 
centage changes in absolute error when using the 
RNN over the PARAGRAM vector addition model. 
So negative values indicate superior performance by 
the RNN. 

A few trends emerge from this table. One is that 
as the length ratio increases (i.e., the phrase pairs 
are closer in length), addition surpasses the RNN 
for positive examples. For negative examples, the 
trend is reversed. The same trend appears for over- 

15 The bin delimiters were chosen to be uniform over the 
range of output values of the length ratio ([0.4,1] with one out¬ 
lier data point removed) and overlap ratio ([0,1]). 


Length Ratio 

[0,0.6] 

(0.6,0.8] 

(0.8,1] 

Positive Examples 

-22.4 

10.0 

35.5 

Negative Examples 

-9.9 

-11.1 

-12.2 

Both 

-13.0 

-6.4 

-2.0 

Overlap Ratio 

Ml 

(ill 

(1.1] 

Positive Examples 

-4.5 

7.0 

19.4 

Negative Examples 

-11.3 

-7.5 

-15.0 

Both 

-10.6 

-5.3 

0.0 


Table 9: Comparison of the addition and RNN model on phrase 
pairs of different overlap and length ratios. The values in the 
table are the percent change in absolute error from the addition 
model to the RNN model. Negative examples are defined as 
pairs from Annotated-PPDB whose gold score is less than 2 and 
positive examples are those with scores greater than 4. “Both” 
refers to both negative and positive examples. 


lap ratio. Examples from Annotated-PPDB illustrat¬ 
ing these trends on positive examples are shown in 
Table 1 

When considering both positive and negative ex¬ 
amples (“Both”), we see that the RNN excels on the 
most difficult examples (large differences in phrase 
length and less lexical overlap). For easier exam¬ 
ples, the two fare similarly overall (-2.0 to 0.0% 
change), but the RNN does much better on nega¬ 
tive examples. This aligns with the intuition that 
addition should perform well when two paraphrastic 
phrases have high lexical overlap and similar length. 
But when they are not paraphrases, simple addition 
is misled and the RNN’s learned composition func¬ 
tion better captures the relationship. This may sug¬ 
gest new architectures for modeling composition- 
ality differently depending on differences in length 
and amount of overlap. 






























Model 

n 

SL999 

WS353 

WS-S 

WS-R 

GloVe 

300 

0.376 

0.579 

0.630 

0.571 

PARAGRAM3oo,WS353 

300 

0.667 

0.769 

0.814 

0.730 

PARAGRAM3oo,SL999 

300 

0.685 

0.720 

0.779 

0.652 

inter-annotator agreement* 

N/A 

0.67 

0.756 

N/A 

N/A 


Table 10: Evaluation of 300 dimensional PARAGRAM vectors on SL999 and WS353. Note that the inter-annotator agreement p 
was calculated differently for WS353 and SL999. For SL999, the agreement was computed as the average pairwise correlation 
between pairs of annotators, while for WS353, agreement was computed as the average correlation between a single annotator with 
the average over all other annotators. If one uses the alternative measure of agreement for WS353, the agreement is 0.611. which is 
easily beaten by automatic methods (Hill et al., 2014b) . 


Model 

word vectors 

n 

comp. 

Mitchel 

and Lapata (2010) Bigrams 

ML-Paraphrase 

JN 

NN 

VN 

Avg 

JN 

NN 

VN 

Avg 

GloVe 

300 

+ 

0.40 

0.46 

0.37 

0.41 

0.39 

0.36 

0.45 

0.40 

PARAGRAM 30 o,WS353 

300 

+ 

0.52 

0.41 

0.49 

0.48 

0.55 

0.42 

0.55 

0.51 

PARAGRAM 30 QSL 999 

300 

+ 

0.51 

0.36 

0.51 

0.46 

0.57 

0.39 

0.59 

0.52 


Table 11: Evaluation of 300 dimensional PARAGRAM vectors on the bigram tasks. 


8 Conclusion 


We have shown how to leverage PPDB to learn 
state-of-the-art word embeddings and compositional 
models for paraphrase tasks. Since PPDB was cre¬ 
ated automatically from parallel corpora, our models 
are also built automatically. Only small amounts of 
annotated data are used to tune hyperparameters. 

We also introduced two new datasets to evaluate 
compositional models of short paraphrases, tilling a 
gap in the NLP community, as currently there are no 
datasets created for this purpose. Successful mod¬ 
els on these datasets can then be used to extend the 
coverage of, or provide an alternative to, PPDB. 

There remains a great deal of work to be done 
in developing new composition models, whether 
with new network architectures or distance func¬ 
tions. In this work, we based our composi¬ 
tion function on constituent parse trees, but this 
may not be the best approach—especially for short 
phrases. Dependency syntax may be a better al¬ 
ternative ( jSocher et ah, 2014] ). Besides improving 
composition, another direction to explore is how to 
use models for short phrases in sentence-level para¬ 
phrase recognition and other downstream tasks. 


Appendix A 


Model 

word vectors 

n 

comp. 

Annotated-PPDB 

GloVe 

300 

+ 

0.27 

PARAGRAM300,WS353 

300 

+ 

0.43 

PARAGRAM300,SL999 

300 

+ 

0.41 


Table 12: Evaluation of 300 dimensional PARAGRAM vectors 
on Annotated-PPDB. 

up our original 25-dimensional PARAGRAM embed¬ 
dings and modified our training procedure slightly in 
order to produce two sets of 300-dimensional PARA¬ 
GRAM vectors^ The vectors outperform our origi¬ 
nal 25-dimensional PARAGRAM vectors on all tasks 
and achieve human-level performance on SL999 and 
WS353. Moreover, when simply using vector ad¬ 
dition as a compositional model, they are both on 
par with the RNN models we trained specifically for 
each task. These results can be seen in Tables fTOlfTTl 
and[T2l 

The main modification was to use higher¬ 
dimensional initial embeddings, in our case the 
pretrained 300-dimensional GloVe embeddings!^ 
Since PPDB only contains lowercased words, we ex¬ 
tracted only one GloVe vector per word type (regard¬ 
less of case) by taking the first occurrence of each 
word in the vocabulary. This is the vector for the 
most common casing of the word, and was used as 


Increasing the dimension of word embeddings or 
training them on more data can have a signifi¬ 
cant positive impact on many tasks—both at the 
word level and on downstream tasks. We scaled 


"’Both PARAGRAM3oo,ws353 and PARAGRAM3oo,sl999 vec¬ 
tors can be found on the authors’ websites. 

17 We used the GloVe vectors trained on 840 bil¬ 
lion tokens of Common Crawl data, available at 

http: / /nip. Stanford . edu /pro jects/glove/] 







































the word’s single initial vector in our experiments. 
This reduced the vocabulary from the original 2.2 
million types to 1.7 million. 

Smaller changes included replacing dot product 
with cosine similarity in Equation [2] and a change to 
the negative sampling procedure. We experimented 
with three approaches: MAX sampling discussed in 
Section |4~T1 RAND sampling which is random sam¬ 
pling from the batch, and a 50/50 mixture of MAX 
sampling and RAND sampling. 

For training data, we selected all word pairs 
in the lexical portion of PPDB XL that were in 
our vocabulary, removing redundancies. This re¬ 
sulted in 169,591 pairs for training. We trained 
our models for 10 epochs and tuned hyperparam¬ 
eters (batch size, A w w , 8, and sampling method) 
in two ways: maximum correlation on WS353 
(PARAGRAM 300 ,WS 353 ) and maximum correlation 
on SL999 (PARAGRAM 3 oo,SL 999 )@ We report re¬ 
sults for both sets of embeddings in Tables [TO], QT1 
and Q21 and make both available to the community 
in the hope that they may be useful for other down¬ 
stream tasks. 
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